{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Bag of Words Text Classifier\n",
    "\n",
    "The code below implements a simple bag of words text classifier.\n",
    "- We tokenize the text, create a vocabulary and encode each piece of text in the dataset\n",
    "- The lookup allows for extracting embeddings for each tokenized inputs\n",
    "- The embedding vectors are added together with a bias vector\n",
    "- The resulting vector is referred to as the scores\n",
    "- The score are applied a softmax to generate probabilities which are used for the classification task\n",
    "\n",
    "The code used in this notebook was inspired by code from the [official repo](https://github.com/neubig/nn4nlp-code) used in the [CMU Neural Networks for NLP class](http://www.phontron.com/class/nn4nlp2021/schedule.html) by [Graham Neubig](http://www.phontron.com/index.php). \n",
    "\n",
    "We are also adding a PyTorch data loader to this notebook which is how it differs from `bow.ipynb`.\n",
    "\n",
    "![img txt](../img/bow.png?raw=true)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import torch\n",
    "import random\n",
    "import torch.nn as nn"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Download the Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%capture\n",
    "\n",
    "# download the files\n",
    "!wget https://raw.githubusercontent.com/neubig/nn4nlp-code/master/data/classes/dev.txt\n",
    "!wget https://raw.githubusercontent.com/neubig/nn4nlp-code/master/data/classes/test.txt\n",
    "!wget https://raw.githubusercontent.com/neubig/nn4nlp-code/master/data/classes/train.txt\n",
    "\n",
    "# create the data folders\n",
    "!mkdir data data/classes\n",
    "!cp dev.txt data/classes\n",
    "!cp test.txt data/classes\n",
    "!cp train.txt data/classes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Read the Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 79,
   "metadata": {},
   "outputs": [],
   "source": [
    "# function to read in data, process each line and split columns by \" ||| \"\n",
    "def read_data(filename):\n",
    "    data = []\n",
    "    with open(filename, 'r') as f:\n",
    "        for line in f:\n",
    "            line = line.lower().strip()\n",
    "            line = line.split(' ||| ')\n",
    "            data.append(line)\n",
    "    return data\n",
    "\n",
    "train_data = read_data('data/classes/train.txt')\n",
    "test_data = read_data('data/classes/test.txt')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Construct the Vocab and Datasets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 80,
   "metadata": {},
   "outputs": [],
   "source": [
    "# creating the word and tag indices\n",
    "word_to_index = {}\n",
    "word_to_index[\"<unk>\"] = len(word_to_index) # adds <UNK> to dictionary\n",
    "tag_to_index = {}\n",
    "\n",
    "# create word to index dictionary and tag to index dictionary from data\n",
    "def create_dict(data, check_unk=False):\n",
    "    for line in data:\n",
    "        for word in line[1].split(\" \"):\n",
    "            if check_unk == False:\n",
    "                if word not in word_to_index:\n",
    "                    word_to_index[word] = len(word_to_index)\n",
    "            else:\n",
    "                if word not in word_to_index:\n",
    "                    word_to_index[word] = word_to_index[\"<unk>\"]\n",
    "\n",
    "        if line[0] not in tag_to_index:\n",
    "            tag_to_index[line[0]] = len(tag_to_index)\n",
    "\n",
    "create_dict(train_data)\n",
    "create_dict(test_data, check_unk=True)\n",
    "\n",
    "# create word and tag tensors from data\n",
    "def create_tensor(data):\n",
    "    for line in data:\n",
    "        yield [[word_to_index[word] for word in line[1].split(\" \")], tag_to_index[line[0]]]\n",
    "\n",
    "train_data = [*create_tensor(train_data)]\n",
    "test_data = [*create_tensor(test_data)]\n",
    "\n",
    "number_of_words = len(word_to_index)\n",
    "number_of_tags = len(tag_to_index)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Convert data to PyTorch Dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 89,
   "metadata": {},
   "outputs": [],
   "source": [
    "from torch.utils.data import DataLoader\n",
    "from torch.utils.data import Dataset\n",
    "\n",
    "# load data into a dataset and dataloader; ensure that the data is split X, y\n",
    "class TextDataset(Dataset):\n",
    "    def __init__(self, data):\n",
    "        self.data = data\n",
    "\n",
    "    def __len__(self):\n",
    "        return len(self.data)\n",
    "\n",
    "    def __getitem__(self, idx):\n",
    "        return torch.as_tensor(self.data[idx][0]), torch.as_tensor(self.data[idx][1])\n",
    "\n",
    "train_dataset = TextDataset(train_data)\n",
    "test_dataset = TextDataset(test_data)\n",
    "\n",
    "train_loader = DataLoader(train_dataset, batch_size=1, shuffle=True)\n",
    "test_loader = DataLoader(test_dataset, batch_size=1, shuffle=True)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 95,
   "metadata": {},
   "outputs": [],
   "source": [
    "# cpu or gpu\n",
    "device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n",
    "\n",
    "# create a simple neural network with embedding layer, bias, and xavier initialization\n",
    "class BoW(torch.nn.Module):\n",
    "    def __init__(self, nwords, ntags):\n",
    "        super(BoW, self).__init__()\n",
    "        self.embedding = nn.Embedding(nwords, ntags)\n",
    "        nn.init.xavier_uniform_(self.embedding.weight)\n",
    "\n",
    "        type = torch.cuda.FloatTensor if torch.cuda.is_available() else torch.FloatTensor\n",
    "        self.bias = torch.zeros(ntags, requires_grad=True).type(type)\n",
    "\n",
    "    def forward(self, x):\n",
    "        emb = self.embedding(x) # seq_len x ntags (for each seq) \n",
    "        out = torch.sum(emb, dim=0) + self.bias # ntags\n",
    "        out = out.view(1, -1) # reshape to (1, ntags)\n",
    "        return out"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Train the Model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 101,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "ITER: 1 | train loss/sent: 1.4731 | train accuracy: 0.3668 | test accuracy: 0.3778\n",
      "ITER: 2 | train loss/sent: 1.1223 | train accuracy: 0.6056 | test accuracy: 0.4118\n",
      "ITER: 3 | train loss/sent: 0.9106 | train accuracy: 0.7155 | test accuracy: 0.4186\n",
      "ITER: 4 | train loss/sent: 0.7685 | train accuracy: 0.7687 | test accuracy: 0.4032\n",
      "ITER: 5 | train loss/sent: 0.6635 | train accuracy: 0.8070 | test accuracy: 0.4054\n",
      "ITER: 6 | train loss/sent: 0.5814 | train accuracy: 0.8346 | test accuracy: 0.4113\n",
      "ITER: 7 | train loss/sent: 0.5157 | train accuracy: 0.8558 | test accuracy: 0.3991\n",
      "ITER: 8 | train loss/sent: 0.4631 | train accuracy: 0.8722 | test accuracy: 0.3946\n",
      "ITER: 9 | train loss/sent: 0.4183 | train accuracy: 0.8839 | test accuracy: 0.4014\n",
      "ITER: 10 | train loss/sent: 0.3807 | train accuracy: 0.8969 | test accuracy: 0.3928\n"
     ]
    }
   ],
   "source": [
    "# train and test the BoW model\n",
    "model = BoW(number_of_words, number_of_tags).to(device)\n",
    "criterion = nn.CrossEntropyLoss()\n",
    "optimizer = torch.optim.Adam(model.parameters())\n",
    "type = torch.LongTensor\n",
    "\n",
    "if torch.cuda.is_available():\n",
    "    model.to(device)\n",
    "    type = torch.cuda.LongTensor\n",
    "\n",
    "# perform training of the Bow model\n",
    "def train_bow(model, optimizer, criterion, train_data):\n",
    "    for ITER in range(10):\n",
    "        # perform training\n",
    "        model.train()\n",
    "        total_loss = 0.0\n",
    "        train_correct = 0\n",
    "        for batch, (sentence, tag) in enumerate(train_loader):\n",
    "            sentence = sentence[0].to(device)\n",
    "            tag = tag.to(device)\n",
    "\n",
    "            output = model(sentence)\n",
    "            predicted = torch.argmax(output.data.detach()).item()\n",
    "            \n",
    "            loss = criterion(output, tag)\n",
    "            total_loss += loss.item()\n",
    "\n",
    "            optimizer.zero_grad()\n",
    "            loss.backward()\n",
    "            optimizer.step()\n",
    "\n",
    "            if predicted == tag: train_correct+=1\n",
    "\n",
    "        # perform testing of the model\n",
    "        model.eval()\n",
    "        test_correct = 0\n",
    "        for batch, (sentence, tag) in enumerate(test_loader):\n",
    "            sentence = sentence[0].to(device)\n",
    "            output = model(sentence)\n",
    "            predicted = torch.argmax(output.data.detach()).item()\n",
    "            if predicted == tag: test_correct += 1\n",
    "        \n",
    "        # print model performance results\n",
    "        log = f'ITER: {ITER+1} | ' \\\n",
    "            f'train loss/sent: {total_loss/len(train_data):.4f} | ' \\\n",
    "            f'train accuracy: {train_correct/len(train_data):.4f} | ' \\\n",
    "            f'test accuracy: {test_correct/len(test_data):.4f}'\n",
    "        print(log)\n",
    "\n",
    "# call the train_bow function\n",
    "train_bow(model, optimizer, criterion, train_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercises\n",
    "\n",
    "To keep on practising, you can try the following exercises:\n",
    "\n",
    "- Try to use different batch sizes and see how it affects the training.\n",
    "- Try to use [`torchtext`](https://pytorch.org/text/stable/index.html#) to load other datasets and create tokenizer and vocabularies for them. This [example](https://pytorch.org/tutorials/beginner/transformer_tutorial.html) on Transformer could be useful to help guide you.\n",
    "- Write a mini Python library easily help you train and evaluate models. You can use the code from this notebook as a starting point."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "nlp",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.13"
  },
  "orig_nbformat": 4,
  "vscode": {
   "interpreter": {
    "hash": "154abf72fb8cc0db1aa0e7366557ff891bff86d6d75b7e5f2e68a066d591bfd7"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
